Introduction to Bayesian Models

Steve Elston

01/13/2021

Review

The concept of likelihood and maximum likelihood estimation (MLE) have been at the core of much of statistical modeling for about 100 years

Review

Statistical inference seeks to characterize the uncertainty in estimates

Review

Bootstrap estimation is widely useful and requires minimal assumption

Review

There are several variations of the basic nonparametric bootstrap algorithm

Review

Re-sampling methods are general and powerful but, there is no magic involved! There are pitfalls!

Introduction

Despite the long history, Bayesian models have not been used extensively until recently

Introduction

Bayesian analysis is a contrast to frequentist methods

Bayesian Model Use Case

Bayesian methods made global headlines with the successful location of the missing Air France Flight 447

Posterior distribution of locations of Air France 447

Posterior distribution of locations of Air France 447

Bayesian vs. Frequentist Views

With greater computational power and general acceptance, Bayes methods are now widely used

Bayesian vs. Frequentist Views

Can compare the contrasting frequentist and Bayesian approaches

Comparison of frequentist and Bayes methods

Comparison of frequentist and Bayes methods

Review of Bayes Theorem

Bayes’ Theorem is fundamental to Bayesian data analysis.

\[P(A \cap B) = P(A|B) P(B) \]

We can also write:

\[P(A \cap B) = P(B|A) P(A) \]

Eliminating \(P(A \cap B):\)

\[ P(B)P(A|B) = P(A)P(B|A)\]

Or, Bayes theorem!

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

Bayes Theorem

Bayes Theorem!

Bayes Theorem!

Marginal Distributions

In many cases we are interested in the marginal distribution

\[p(\theta_1) = \int_{\theta_2, \ldots, \theta_n} p(\theta_1, \theta_2, \ldots, \theta_n)\ d\theta2, \ldots, d \theta_n\] - But computing this integral is not easy!

Marginal Distributions

\[ p(\theta) = \sum_{x \in \mathbf{X}} p(\theta |\mathbf{X})\ p(\mathbf{X}) \]

\[ p(\mathbf{X}) = \sum_{\theta \in \Theta} p(\mathbf{X} |\theta) p(\theta) \]

Interpreting Bayes Theorem

How can you interpret Bayes’ Theorem?

\[Posterior\ Distribution = \frac{Likelihood \bullet Prior\ Distribution}{Evidence} \]

\[ posterior\ distribution(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \\ \frac{Likelihood(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃rior(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]

\[ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠│𝑑𝑎𝑡𝑎) = \frac{P(𝑑𝑎𝑡𝑎|𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)\ 𝑃(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)}{P(data)} \]

Interpreting Bayes Theorem

What do these terms actually mean?

  1. Posterior distribution of the parameters given the evidence or data, the goal of Bayesian analysis

  2. Prior distribution is chosen to express information available about the model parameters apriori

  3. Likelihood is the conditional distribution of the data given the model parameters

  4. Data or evidence is the distribution of the data and normalizes the posterior

Relationships can apply to the parameters in a model; partial slopes, intercept, error distributions, lasso constants, etc

Applying Bayes Theorem

We need a tractable formulation of Bayes Theorem for computational problems

\[ 𝑃(𝐵 \cap A) = 𝑃(𝐵|𝐴)𝑃(𝐴) \\ And \\ 𝑃(𝐵)=𝑃(𝐵 \cap 𝐴)+𝑃(𝐵 \cap \bar{𝐴}) \]

Where, \(\bar{A} = not\ A\), and the marginal distribution, \(P(B)\), can be written:

\[ 𝑃(𝐵)=𝑃(𝐵|𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴}) \]

Applying Bayes Theorem

Using the foregoing relations we can rewrite Bayes Theorem as:

\[ P(A|B) = \frac{P(A)P(B|A)}{𝑃(𝐵│𝐴)𝑃(𝐴)+𝑃(𝐵│ \bar{𝐴})𝑃(\bar{𝐴})} \\ \]

Eewrite Bayes Theorem as:

\[𝑃(𝐴│𝐵)=𝑘∙𝑃(𝐵|𝐴)𝑃(𝐴)\]

Ignoring the normalization constant \(k\):

\[𝑃(𝐴│𝐵) \propto 𝑃(𝐵|𝐴)𝑃(𝐴)\]

Interpreting Bayes Theorem

Denominator must account for all possible outcomes, or alternative hypotheses, \(h'\):

\[Posterior(hypothesis\ |\ evidence) =\\ \frac{Likelihood(evidence\ |\ hypothesis)\ prior(hypothesis)}{\sum_{ h' \in\ All\ possible\ hypotheses}Likelihood(evidence\ |\ h')\ prior(h')}\]

This is a formidable problem!

Bayes Theorem Example

Hemophilia is a serious genetic condition expressed on any X chromosome

Bayes Theorem Example

As evidence the woman has two sons (not identical twins) with no expression of hemophilia

\[ p(x_1=0, x_2=0 | \theta = 1) = 0.5 * 0.5 = 0.25 \\ p(x_1=0, x_2=0 | \theta = 0) = 1.0 * 1.0 = 1.0 \]

Note: we are neglecting the possibility of a mutations in one of the sons

Bayes Theorem Example

Use Bayes theorem to compute probability woman carries an X chromosome with hemophilia expression, \(\theta = 1\)

\[ p(\theta=1 | X) = \frac{p(X|\theta=1) p(\theta=1)}{p(X|\theta=1) p(\theta=1) + p(X|\theta=0) p(\theta=0)} \\ = \frac{0.25 * 0.5}{0.25 * 0.5 + 1.0 * 0.5} = 0.20 \]

The evidence of two sons without hemophilia causes us to update our belief that the probability of the woman carrying the disease

Note: The denominator is the sum over all possible hypthises, the marginal distribution of the observations \(\mathbf{X}\)

Simplified Relationship for Bayes Theorem

How to we interpret the foregoing relationship?

\[Posterior\ Distribution \propto Likelihood \bullet Prior\ Distribution \\ Or\\ 𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 │ 𝑑𝑎𝑡𝑎 ) \propto 𝑃( 𝑑𝑎𝑡𝑎 | 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 )𝑃( 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 ) \]

Creating Bayes models

The goal of a Bayesian analysis is computing and performing inference on the posterior distribution of the model parameters

The general steps are as follows:

  1. Identify data relevant to the research question

  2. Define a sampling plan for the data. Data need not be collected in a single batch

  3. Define the model and the likelihood function; e.g. regression model with Normal likelihood

  4. Specify a prior distribution of the model parameters

  5. Use the Bayesian inference formula to compute posterior distribution of the model parameters

  6. Update the posterior as data is observed

  7. Inference on the posterior can be performed; compute credible intervals

  8. Optionally, simulate data values from realizations of the posterior distribution. These values are predictions from the model.

Updating Bayesian Models

An advantage of Bayesain model is that it can be updated as new observations are made

How can you choose a prior?

The choice of the prior is a difficult, and potentially vexing, problem when performing Bayesian analysis

How can you choose a prior?

How to use prior empirical information to estimate the parameters of the prior distribution

Conjugate Prior Distributions

An analytically and computationally simple choice for a prior distribution family is a conjugate prior

Conjugate Prior Distributions

Most commonly used distributions have conjugates, with a few examples:

Likelihood Conjugate
Binomial Beta
Bernoulli Beta
Poisson Gamma
Categorical Dirichlet
Normal - mean Normal
Normal - variance Inverse Gamma
Normal - inverse variance, \(\tau\) Gamma

Example using Conjugate Distribution

We are interested in analyzing the incidence of distracted drivers

\[ P(k) = \binom{n}{k} \cdot \theta^k(1-\theta)^{n-k}\]

Our process is the:

  1. Use the conjugate prior, the Beta distribution with parameters \(\alpha\) and \(\beta\)
  2. Using the data sample, compute the likelihood
  3. Compute the posterior distribution of distracted driving
  4. Add more evidence (data) and update the posterior distribution.

Example using Conjugate Distribution

What are the properties of the Beta distribution?

Beta distribution for different parameter values

Beta distribution for different parameter values

Example using Conjugate Distribution

Consider the product of a Binomial likelihood and a Beta prior

\[\begin{align} posterior(\theta | z, n) &= \frac{likelihood(z,n | \theta)\ prior(\theta)}{data\ distribution (z,n)} \\ p(\theta | z, n) &= \frac{Binomial(z,n | \theta)\ Beta(\theta)}{p(z,n)} \\ &= Beta(z + a -1,\ n-z+b-1) \end{align}\]

Example using Conjugate Distribution

There are some useful insights you can gain from this relationship:

\[ posterior(\theta | z, n) = Beta(z + a -1,\ n-z+b-1) \]

-Evidence is also in the form (actual) counts of successes, \(z\) and failure, \(n-z\)
- The more evidence the greater the influence on the posterior distribution
- Large amount of evidence will overwhelm the prior
- With large amount of evidence, posterior converges to frequentist model

Example using Conjugate Distribution

Consider example with:
- Prior pseudo counts \([1,9]\), successes \(a = 1 + 1\) and failures, \(b = 9 + 1\)
- Evidence, successes \(= 10\) and failures, \(= 30\)
- Posterior is \(Beta(10 + 2 -1,\ 40 - 10 + 10 -1) = Beta(11,\ 39)\)

Prior, likelihood and posterior for distracted driving

Prior, likelihood and posterior for distracted driving

Sampling the Posterior

How can we find an estimate of the poster distribution?

  1. We can sample from the analytic solution - if we have a conjugate

  2. We can sample the likelihood and prior, take the product and normalize - for any posterior

  3. Grid sample or Markov chain Monte Carlo (MCMC) sample

Sampling the Posterior

Grid sampling is a naive approach

Sampling grid for bivariate distribution

Sampling grid for bivariate distribution

Sampling the Posterior

Algorithm for grid sampling to compute posterior from likelihood and prior

Procedure CreateGrid(variables, lower_limits, upper_limits): 
    # Build the sampling grid 
    return sampling_grid   
    
Procedure SampleLikelihood(sampling_value, observation_values):    
    return likelihood_function(sampling_value, observation_values)
    
Procedure Prior(sampling_values, prior_parameter_value):    
    return prior_density_function(sampling_value, prior_parameter_values)    
    
ComputePosterior(variables, lower_limits, upper_limits):    
    # Initialize the sampling grid
    Grid = CreateGrid(variables, lower_limits, upper_limits)
    
    # Initialize array to hold sampled posterior values       
    array posterior[range(Grid)]

    # Compute posterior at each sampling value in the grid  
    for sampling_value in range(lower_limits, upper_limits):   
        likelihood = SampleLikelihood(sampling_value, observation_values)
        prior = Prior(sampling_values, prior_parameter_value)   
        posterior[sampling_value] = likelihood * prior

    # Normalize the posterior       
    probability_data = sum(posterior[range(Grid)])
    posterior = posterior[range(Grid)]/probability_data 
    return posterior    

Credible Intervals

How can we specify the uncertainty for a Bayesian parameter estimate?

Credible Intervals

What are the 95% credible intervals for \(Beta(11,\ 39)\)?

Probability of distract drivers for next 10 cars

Probability of distract drivers for next 10 cars

Credible Intervals are not Confidence Intervals

How are credible intervals different from the more familiar confidence intervals?

Confidence intervials and credible intervals are conceptually quite different

A confidence interval is a purely frequentest concept
- Is an interval on the sampling distribution where repeated samples of a statistic are expected with probability \(= \alpha\)
- Cannot interpret a confidence interval as an interval on a probability distribution of the value of a statistic!

Credible interval is an interval on a posterior distribution of the statistic
- Credible interval is exactly what the misinterpretation of the confidence interval tries to be
- Credible interval is the interval with highest \(\alpha\) probability for the statistic being estimated

Credible Intervals are not Confidence Intervals

Compare confidence interval and credible interval for the case of 10 observations

Difference between credible and confidence intervals

Difference between credible and confidence intervals

Simulating from the posterior distribution: predictions

What else can we do with a Bayesian posterior distribution beyond credible intervals?

Simulating from the posterior distribution: predictions

Example; What are the probabilities of distracted drivers for the next 10 cars with posterior, \(Beta(11,\ 39)\)?

Probability of distract drivers for next 10 cars

Probability of distract drivers for next 10 cars

Summary

Bayesian analysis is a contrast to frequentist methods

Summary

Bayesian analysis is in contrast to frequentist methods